Snippet 1

Load in the data set from disk.

In [1]:
using DataFrames
df = readtable("data/01_heights_weights_genders.csv")
Out[1]:
GenderHeightWeight
1Male73.847017017515241.893563180437
2Male68.7819040458903162.3104725213
3Male74.1101053917849212.7408555565
4Male71.7309784033377220.042470303077
5Male69.8817958611153206.349800623871
6Male67.2530156878065152.212155757083
7Male68.7850812516616183.927888604031
8Male68.3485155115879167.971110489509
9Male67.018949662883175.92944039571
10Male63.4564939783664156.399676387112
11Male71.1953822829745186.604925560358
12Male71.6408051192206213.741169489411
13Male64.7663291334055167.127461073476
14Male69.2830700967204189.446181386738
15Male69.2437322298112186.434168021239
16Male67.6456197004212172.186930058117
17Male72.4183166259878196.028506330482
18Male63.974325721061172.88347020878
19Male69.6400598997523185.98395757313
20Male67.9360048540095182.426648013226
21Male67.9150501938206174.115929081393
22Male69.4394398680395197.73142161472
23Male66.1491319608781149.173566007975
24Male75.2059736142212228.761780615196
25Male67.8931963386043162.006651848287
26Male68.1440327982008192.343976579187
27Male69.0896314289256184.435174408406
28Male72.8008435165003206.828189420354
29Male67.4212422817167175.213922399227
30Male68.4964153568827154.342638925955

Create a numeric DataArray containing just the heights data.

In [2]:
heights = df[:Height]
Out[2]:
10000-element DataArrays.DataArray{Float64,1}:
 73.847 
 68.7819
 74.1101
 71.731 
 69.8818
 67.253 
 68.7851
 68.3485
 67.0189
 63.4565
 71.1954
 71.6408
 64.7663
  ⋮     
 59.5387
 60.9551
 63.1795
 62.6367
 62.0778
 60.0304
 59.0983
 66.1727
 67.0672
 63.868 
 69.0342
 61.9442
In [3]:
median(heights)
Out[3]:
66.31807008178464
In [4]:
mean(heights)
Out[4]:
66.36755975482124

Snippet 2

Define our own mean and median functions.

In [5]:
function mymean(x) 
    return (sum(x) / length(x))
end
Out[5]:
mymean (generic function with 1 method)
In [6]:
function mymedian(x)
    sorted = sort(x)
    if length(x) % 2 == 0
        indices = Array{Int64}(2) 
        indices = [Int64(length(x)/2), Int64(length(x)/2 + 1)]
        return (sorted[indices[1]] + sorted[indices[2]]) / 2
    else 
        index = ceil(length(x)/2)
        return mean(sorted[index])
    end
end
Out[6]:
mymedian (generic function with 1 method)

Snippet 3

Compare means and medians on toy examples.

In [7]:
array = [0,100]
Out[7]:
2-element Array{Int64,1}:
   0
 100
In [8]:
mean(array)
Out[8]:
50.0
In [9]:
median(array)
Out[9]:
50.0
In [10]:
array = [0, 0, 100]
Out[10]:
3-element Array{Int64,1}:
   0
   0
 100
In [11]:
mean(array)
Out[11]:
33.333333333333336
In [12]:
median(array)
Out[12]:
0.0

Snippet 4

Confirm that our mean and median functions produce the correct answer.

In [13]:
mymean(heights)
Out[13]:
66.36755975482124
In [14]:
mymedian(heights)
Out[14]:
66.31807008178464
In [15]:
mean(heights) - mymean(heights)
Out[15]:
0.0
In [16]:
median(heights) - mymedian(heights)
Out[16]:
0.0

Snippet 5

Experiment with functions for assessing the range of a data set.

In [17]:
minimum(heights)
Out[17]:
54.2631333250971

Snippet 6

In [18]:
maximum(heights)
Out[18]:
78.9987423463896

Snippet 7

In [19]:
extrema(heights)
Out[19]:
(54.2631333250971,78.9987423463896)

Snippet 8

In [20]:
quantile(heights, [0, 0.25, 0.5, 0.75, 1])
Out[20]:
5-element Array{Float64,1}:
 54.2631
 63.5056
 66.3181
 69.1743
 78.9987

Snippet 9

In [21]:
quantile(heights, collect(0:0.2:1))
Out[21]:
6-element Array{Float64,1}:
 54.2631
 62.859 
 65.1942
 67.4354
 69.8116
 78.9987

Snippet 10

In [22]:
collect(0:0.2:1)
Out[22]:
6-element Array{Float64,1}:
 0.0
 0.2
 0.4
 0.6
 0.8
 1.0

Snippet 11

Define a variance function to assess the spread of data.

In [23]:
function myvar(x)
    m = mean(x)
    return sum(map(val -> (val - m) ^ 2, x)) / length(x)
end
Out[23]:
myvar (generic function with 1 method)

Snippet 12

Test our variance function for correctness.

In [24]:
myvar(heights)
Out[24]:
14.801992292876758
In [25]:
var(heights)
Out[25]:
14.803472640140773

Snippet 13

Update the variance function to make it unbiased.

In [26]:
function myunbiasedvar(x)
    m = mean(x)
    return sum(map(val -> (val - m) ^ 2, x)) / (length(x) - 1)
end
Out[26]:
myunbiasedvar (generic function with 1 method)

Test our variance function again for correctness.

In [27]:
myunbiasedvar(heights)
Out[27]:
14.803472640140772
In [28]:
var(heights)
Out[28]:
14.803472640140773

Snippet 14

Check the range predicted by the variance function.

In [29]:
mean(heights) - var(heights), mean(heights) + var(heights)
Out[29]:
(51.56408711468047,81.17103239496201)

Snippet 15

In [30]:
mean(heights) - var(heights), mean(heights) + var(heights)
Out[30]:
(51.56408711468047,81.17103239496201)
In [31]:
extrema(heights)
Out[31]:
(54.2631333250971,78.9987423463896)

Snippet 16

Switch to standard deviations instead for thinking about ranges.

In [32]:
function mystd(x)
    return sqrt(myunbiasedvar(x))
end
Out[32]:
mystd (generic function with 1 method)

Snippet 17

Test our standard deviation function for correctness.

In [33]:
mystd(heights)
Out[33]:
3.8475281207732284
In [34]:
std(heights)
Out[34]:
3.847528120773229

Snippet 18

In [35]:
mean(heights) - std(heights), mean(heights) + std(heights)
Out[35]:
(62.52003163404802,70.21508787559448)
In [36]:
extrema(heights)
Out[36]:
(54.2631333250971,78.9987423463896)

Snippet 19

In [37]:
mean(heights) - std(heights), mean(heights) + std(heights)
Out[37]:
(62.52003163404802,70.21508787559448)
In [38]:
quantile(heights, 0.25), quantile(heights, 0.75)
Out[38]:
(63.505620481218955,69.1742617268347)

Snippet 20

Start visualizing data using the ggplot2 package.

In [39]:
using Gadfly
plot(df, x="Height", Geom.histogram(bincount=25))
Out[39]:
Height 10 20 30 40 50 60 70 80 90 100 110 120 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 0 50 100 150 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 -1200 -1000 -800 -600 -400 -200 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 -1000 -950 -900 -850 -800 -750 -700 -650 -600 -550 -500 -450 -400 -350 -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 -1000 0 1000 2000 -1000 -900 -800 -700 -600 -500 -400 -300 -200 -100 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000

Snippet 21

In [40]:
plot(df, x="Height", Geom.histogram(bincount=5))
Out[40]:
Height 10 20 30 40 50 60 70 80 90 100 110 120 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 0 50 100 150 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 -6.0×10³ -5.0×10³ -4.0×10³ -3.0×10³ -2.0×10³ -1.0×10³ 0 1.0×10³ 2.0×10³ 3.0×10³ 4.0×10³ 5.0×10³ 6.0×10³ 7.0×10³ 8.0×10³ 9.0×10³ 1.0×10⁴ 1.1×10⁴ -5.0×10³ -4.8×10³ -4.6×10³ -4.4×10³ -4.2×10³ -4.0×10³ -3.8×10³ -3.6×10³ -3.4×10³ -3.2×10³ -3.0×10³ -2.8×10³ -2.6×10³ -2.4×10³ -2.2×10³ -2.0×10³ -1.8×10³ -1.6×10³ -1.4×10³ -1.2×10³ -1.0×10³ -8.0×10² -6.0×10² -4.0×10² -2.0×10² 0 2.0×10² 4.0×10² 6.0×10² 8.0×10² 1.0×10³ 1.2×10³ 1.4×10³ 1.6×10³ 1.8×10³ 2.0×10³ 2.2×10³ 2.4×10³ 2.6×10³ 2.8×10³ 3.0×10³ 3.2×10³ 3.4×10³ 3.6×10³ 3.8×10³ 4.0×10³ 4.2×10³ 4.4×10³ 4.6×10³ 4.8×10³ 5.0×10³ 5.2×10³ 5.4×10³ 5.6×10³ 5.8×10³ 6.0×10³ 6.2×10³ 6.4×10³ 6.6×10³ 6.8×10³ 7.0×10³ 7.2×10³ 7.4×10³ 7.6×10³ 7.8×10³ 8.0×10³ 8.2×10³ 8.4×10³ 8.6×10³ 8.8×10³ 9.0×10³ 9.2×10³ 9.4×10³ 9.6×10³ 9.8×10³ 1.0×10⁴ -5×10³ 0 5×10³ 1×10⁴ -5.0×10³ -4.5×10³ -4.0×10³ -3.5×10³ -3.0×10³ -2.5×10³ -2.0×10³ -1.5×10³ -1.0×10³ -5.0×10² 0 5.0×10² 1.0×10³ 1.5×10³ 2.0×10³ 2.5×10³ 3.0×10³ 3.5×10³ 4.0×10³ 4.5×10³ 5.0×10³ 5.5×10³ 6.0×10³ 6.5×10³ 7.0×10³ 7.5×10³ 8.0×10³ 8.5×10³ 9.0×10³ 9.5×10³ 1.0×10⁴

Snippet 22

In [41]:
plot(df, x="Height", Geom.histogram(bincount=10000))
Out[41]:
Height 10 20 30 40 50 60 70 80 90 100 110 120 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 0 50 100 150 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 -8.0 -7.5 -7.0 -6.5 -6.0 -5.5 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 -10 0 10 20 -8.0 -7.5 -7.0 -6.5 -6.0 -5.5 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0

Snippet 23

Experiment with kernel density estimates.

In [42]:
plot(df, x="Height", Geom.histogram)
Out[42]:
Height 10 20 30 40 50 60 70 80 90 100 110 120 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 0 50 100 150 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 -200 -190 -180 -170 -160 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 -200 0 200 400 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400

Snippet 24

Separate out heights and weights based on gender.

In [43]:
plot(df, x="Height", y="Gender", Geom.density)
Out[43]:
Height 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 122 124 126 128 130 0 50 100 150 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20 0.25 -0.100 -0.095 -0.090 -0.085 -0.080 -0.075 -0.070 -0.065 -0.060 -0.055 -0.050 -0.045 -0.040 -0.035 -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115 0.120 0.125 0.130 0.135 0.140 0.145 0.150 0.155 0.160 0.165 0.170 0.175 0.180 0.185 0.190 0.195 0.200 0.205 -0.1 0.0 0.1 0.2 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 Gender
In [44]:
plot(df, x="Height", y="Gender", color="Gender", Geom.density)
Out[44]:
Height 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 122 124 126 128 130 0 50 100 150 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 Male Female Gender -0.20 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 -0.155 -0.150 -0.145 -0.140 -0.135 -0.130 -0.125 -0.120 -0.115 -0.110 -0.105 -0.100 -0.095 -0.090 -0.085 -0.080 -0.075 -0.070 -0.065 -0.060 -0.055 -0.050 -0.045 -0.040 -0.035 -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115 0.120 0.125 0.130 0.135 0.140 0.145 0.150 0.155 0.160 0.165 0.170 0.175 0.180 0.185 0.190 0.195 0.200 0.205 0.210 0.215 0.220 0.225 0.230 0.235 0.240 0.245 0.250 0.255 0.260 0.265 0.270 0.275 0.280 0.285 0.290 0.295 0.300 0.305 -0.2 0.0 0.2 0.4 -0.16 -0.15 -0.14 -0.13 -0.12 -0.11 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 Gender

Snippet 25

In [45]:
plot(df, x="Weight", y="Gender", color="Gender", Geom.density)
Out[45]:
Weight -400 -300 -200 -100 0 100 200 300 400 500 600 700 -300 -290 -280 -270 -260 -250 -240 -230 -220 -210 -200 -190 -180 -170 -160 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 480 490 500 510 520 530 540 550 560 570 580 590 600 -300 0 300 600 -300 -280 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 520 540 560 580 600 Male Female Gender -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 -0.025 -0.024 -0.023 -0.022 -0.021 -0.020 -0.019 -0.018 -0.017 -0.016 -0.015 -0.014 -0.013 -0.012 -0.011 -0.010 -0.009 -0.008 -0.007 -0.006 -0.005 -0.004 -0.003 -0.002 -0.001 0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.010 0.011 0.012 0.013 0.014 0.015 0.016 0.017 0.018 0.019 0.020 0.021 0.022 0.023 0.024 0.025 0.026 0.027 0.028 0.029 0.030 0.031 0.032 0.033 0.034 0.035 0.036 0.037 0.038 0.039 0.040 0.041 0.042 0.043 0.044 0.045 0.046 0.047 0.048 0.049 0.050 0.051 -0.05 0.00 0.05 -0.026 -0.024 -0.022 -0.020 -0.018 -0.016 -0.014 -0.012 -0.010 -0.008 -0.006 -0.004 -0.002 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014 0.016 0.018 0.020 0.022 0.024 0.026 0.028 0.030 0.032 0.034 0.036 0.038 0.040 0.042 0.044 0.046 0.048 0.050 0.052 Gender

Snippet 26

Produce two facets in a single plot to make it easier to see the hidden structure.

In [46]:
plot(df, ygroup="Gender" ,x="Weight", y="Gender", color="Gender", Geom.subplot_grid(Geom.density))
Out[46]:
Weight Male Female Gender -400 -300 -200 -100 0 100 200 300 400 500 600 700 -300 -290 -280 -270 -260 -250 -240 -230 -220 -210 -200 -190 -180 -170 -160 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 480 490 500 510 520 530 540 550 560 570 580 590 600 -300 0 300 600 -300 -280 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 520 540 560 580 600 -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 -0.025 -0.024 -0.023 -0.022 -0.021 -0.020 -0.019 -0.018 -0.017 -0.016 -0.015 -0.014 -0.013 -0.012 -0.011 -0.010 -0.009 -0.008 -0.007 -0.006 -0.005 -0.004 -0.003 -0.002 -0.001 0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.010 0.011 0.012 0.013 0.014 0.015 0.016 0.017 0.018 0.019 0.020 0.021 0.022 0.023 0.024 0.025 0.026 0.027 0.028 0.029 0.030 0.031 0.032 0.033 0.034 0.035 0.036 0.037 0.038 0.039 0.040 0.041 0.042 0.043 0.044 0.045 0.046 0.047 0.048 0.049 0.050 0.051 -0.05 0.00 0.05 -0.026 -0.024 -0.022 -0.020 -0.018 -0.016 -0.014 -0.012 -0.010 -0.008 -0.006 -0.004 -0.002 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014 0.016 0.018 0.020 0.022 0.024 0.026 0.028 0.030 0.032 0.034 0.036 0.038 0.040 0.042 0.044 0.046 0.048 0.050 0.052 Female -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 -0.025 -0.024 -0.023 -0.022 -0.021 -0.020 -0.019 -0.018 -0.017 -0.016 -0.015 -0.014 -0.013 -0.012 -0.011 -0.010 -0.009 -0.008 -0.007 -0.006 -0.005 -0.004 -0.003 -0.002 -0.001 0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.010 0.011 0.012 0.013 0.014 0.015 0.016 0.017 0.018 0.019 0.020 0.021 0.022 0.023 0.024 0.025 0.026 0.027 0.028 0.029 0.030 0.031 0.032 0.033 0.034 0.035 0.036 0.037 0.038 0.039 0.040 0.041 0.042 0.043 0.044 0.045 0.046 0.047 0.048 0.049 0.050 0.051 -0.05 0.00 0.05 -0.026 -0.024 -0.022 -0.020 -0.018 -0.016 -0.014 -0.012 -0.010 -0.008 -0.006 -0.004 -0.002 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014 0.016 0.018 0.020 0.022 0.024 0.026 0.028 0.030 0.032 0.034 0.036 0.038 0.040 0.042 0.044 0.046 0.048 0.050 0.052 Male Gender by Gender

Snippet 27

Experiment with random numbers from the normal distribution.

In [47]:
using Distributions

plot(DataFrame(X = rand(Normal(0, 1), 10000) + rand_normal(0, 1)), x="X", Geom.density)
LoadError: UndefVarError: rand_normal not defined
while loading In[47], in expression starting on line 3

Snippet 28

Compare the normal distribution with the Cauchy distribution.

In [48]:
srand(1)
normal_values = rand(Normal(0, 1), 250)
cauchy_values = rand(Cauchy(0, 1), 250)
extrema(normal_values)
Out[48]:
(-2.748600587280482,2.2950878238373105)
In [49]:
extrema(cauchy_values)
Out[49]:
(-148.8706482930875,81.40805157069148)

Snippet 29

In [50]:
plot(DataFrame(X = normal_values), x="X", Geom.density, Guide.title("Normal distribution"))
Out[50]:
X -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 -12.0 -11.5 -11.0 -10.5 -10.0 -9.5 -9.0 -8.5 -8.0 -7.5 -7.0 -6.5 -6.0 -5.5 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 -20 -10 0 10 20 -12.0 -11.5 -11.0 -10.5 -10.0 -9.5 -9.0 -8.5 -8.0 -7.5 -7.0 -6.5 -6.0 -5.5 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 -0.40 -0.38 -0.36 -0.34 -0.32 -0.30 -0.28 -0.26 -0.24 -0.22 -0.20 -0.18 -0.16 -0.14 -0.12 -0.10 -0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 0.50 0.52 0.54 0.56 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.74 0.76 0.78 0.80 0.82 -0.5 0.0 0.5 1.0 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Normal distribution
In [51]:
plot(DataFrame(X = cauchy_values), x="X", Geom.density, Guide.title("Cauchy distribution"))
Out[51]:
X -600 -500 -400 -300 -200 -100 0 100 200 300 400 500 -500 -490 -480 -470 -460 -450 -440 -430 -420 -410 -400 -390 -380 -370 -360 -350 -340 -330 -320 -310 -300 -290 -280 -270 -260 -250 -240 -230 -220 -210 -200 -190 -180 -170 -160 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 -500 0 500 -500 -480 -460 -440 -420 -400 -380 -360 -340 -320 -300 -280 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 -0.30 -0.25 -0.20 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 -0.25 -0.24 -0.23 -0.22 -0.21 -0.20 -0.19 -0.18 -0.17 -0.16 -0.15 -0.14 -0.13 -0.12 -0.11 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50 -0.25 0.00 0.25 0.50 -0.26 -0.24 -0.22 -0.20 -0.18 -0.16 -0.14 -0.12 -0.10 -0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 0.50 Cauchy distribution

Snippet 30

Experiment with random numbers from the gamma distribution.

In [52]:
gamma_values = rand(Gamma(1, 0.001), 100000)
plot(DataFrame(X = gamma_values), x="X", Geom.density, Guide.title("Gamma values"))
Out[52]:
X -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 -0.025 -0.024 -0.023 -0.022 -0.021 -0.020 -0.019 -0.018 -0.017 -0.016 -0.015 -0.014 -0.013 -0.012 -0.011 -0.010 -0.009 -0.008 -0.007 -0.006 -0.005 -0.004 -0.003 -0.002 -0.001 0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.010 0.011 0.012 0.013 0.014 0.015 0.016 0.017 0.018 0.019 0.020 0.021 0.022 0.023 0.024 0.025 0.026 0.027 0.028 0.029 0.030 0.031 0.032 0.033 0.034 0.035 0.036 -0.04 -0.02 0.00 0.02 0.04 -0.026 -0.024 -0.022 -0.020 -0.018 -0.016 -0.014 -0.012 -0.010 -0.008 -0.006 -0.004 -0.002 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014 0.016 0.018 0.020 0.022 0.024 0.026 0.028 0.030 0.032 0.034 0.036 -1500 -1000 -500 0 500 1000 1500 2000 2500 -1000 -950 -900 -850 -800 -750 -700 -650 -600 -550 -500 -450 -400 -350 -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 -1000 0 1000 2000 -1000 -900 -800 -700 -600 -500 -400 -300 -200 -100 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 Gamma values

Snippet 31

Generate scatterplots of the heights and weights to see their relationship.

In [53]:
plot(df, x="Height", y="Weight", Geom.point)
Out[53]:
Height 10 20 30 40 50 60 70 80 90 100 110 120 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 0 50 100 150 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 -400 -300 -200 -100 0 100 200 300 400 500 600 700 -300 -290 -280 -270 -260 -250 -240 -230 -220 -210 -200 -190 -180 -170 -160 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 480 490 500 510 520 530 540 550 560 570 580 590 600 -300 0 300 600 -300 -280 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 520 540 560 580 600 Weight

Snippet 32

Add a smooth shape that relates the two explicitly.

In [54]:
xmin, xmax = extrema(df[:Height])
ymin, ymax = extrema(df[:Weight])
plot(layer(df, x="Height", y="Weight", Geom.point, order = 1),
layer(x=[xmin, xmax],y=[ymin, ymax], Geom.point,Geom.line, Theme(default_color=colorant"purple"), order = 10))
Out[54]:
Height 10 20 30 40 50 60 70 80 90 100 110 120 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 0 50 100 150 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 -400 -300 -200 -100 0 100 200 300 400 500 600 700 -300 -290 -280 -270 -260 -250 -240 -230 -220 -210 -200 -190 -180 -170 -160 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 480 490 500 510 520 530 540 550 560 570 580 590 600 -300 0 300 600 -300 -280 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 520 540 560 580 600 Weight

Snippet 33

See how the smooth shape gets better with more data.

  • 20 points
In [55]:
plot(df[1:21,:], x="Height", y="Weight", Geom.point)
Out[55]:
Height 40 45 50 55 60 65 70 75 80 85 90 95 45.0 45.5 46.0 46.5 47.0 47.5 48.0 48.5 49.0 49.5 50.0 50.5 51.0 51.5 52.0 52.5 53.0 53.5 54.0 54.5 55.0 55.5 56.0 56.5 57.0 57.5 58.0 58.5 59.0 59.5 60.0 60.5 61.0 61.5 62.0 62.5 63.0 63.5 64.0 64.5 65.0 65.5 66.0 66.5 67.0 67.5 68.0 68.5 69.0 69.5 70.0 70.5 71.0 71.5 72.0 72.5 73.0 73.5 74.0 74.5 75.0 75.5 76.0 76.5 77.0 77.5 78.0 78.5 79.0 79.5 80.0 80.5 81.0 81.5 82.0 82.5 83.0 83.5 84.0 84.5 85.0 85.5 86.0 86.5 87.0 87.5 88.0 88.5 89.0 89.5 90.0 40 60 80 100 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 0 50 100 150 200 250 300 350 400 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 165 170 175 180 185 190 195 200 205 210 215 220 225 230 235 240 245 250 255 260 265 270 275 280 285 290 295 300 305 310 315 320 325 330 335 340 345 350 0 100 200 300 400 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 Weight
  • 200 points
In [56]:
plot(df[1:201,:], x="Height", y="Weight", Geom.point)
Out[56]:
Height 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 40 60 80 100 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 -100 -50 0 50 100 150 200 250 300 350 400 450 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 165 170 175 180 185 190 195 200 205 210 215 220 225 230 235 240 245 250 255 260 265 270 275 280 285 290 295 300 305 310 315 320 325 330 335 340 345 350 355 360 365 370 375 380 385 390 395 400 -200 0 200 400 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 Weight
  • 2000 points
In [57]:
plot(df[1:2001,:], x="Height", y="Weight", Geom.point)
Out[57]:
Height 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 0 30 60 90 120 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 480 490 500 -200 0 200 400 600 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 Weight

Snippet 34

Visualize how gender depends on height and weight.

In [58]:
plot(df, x="Height", y="Weight", Geom.point, color="Gender")
Out[58]:
Height 10 20 30 40 50 60 70 80 90 100 110 120 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 0 50 100 150 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 Male Female Gender -400 -300 -200 -100 0 100 200 300 400 500 600 700 -300 -290 -280 -270 -260 -250 -240 -230 -220 -210 -200 -190 -180 -170 -160 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 480 490 500 510 520 530 540 550 560 570 580 590 600 -300 0 300 600 -300 -280 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 520 540 560 580 600 Weight

Snippet 35

In [59]:
xmin, xmax = extrema(df[:Height])
ymin, ymax = extrema(df[:Weight])
plot(layer(df, x="Height", y="Weight", Geom.point, color="Gender", order = 1),
layer(x=[xmin, xmax],y=[ymin, ymax], Geom.point,Geom.line, Theme(default_color=colorant"purple"), order = 10))
Out[59]:
Height 10 20 30 40 50 60 70 80 90 100 110 120 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 0 50 100 150 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 Male Female Gender -400 -300 -200 -100 0 100 200 300 400 500 600 700 -300 -290 -280 -270 -260 -250 -240 -230 -220 -210 -200 -190 -180 -170 -160 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 480 490 500 510 520 530 540 550 560 570 580 590 600 -300 0 300 600 -300 -280 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 520 540 560 580 600 Weight